{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Exploration\n",
    "\n",
    "## Goals\n",
    "1. Introduction to Sentiment Analysis use case\n",
    "2. How to quickly understand a real world data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "A useful application of machine learning is \"sentiment analysis\".  Here we are trying to determine if a person feels positively or negatively about what they're writing about.  One important application of sentiment analysis is for marketing departments to understand what people are saying about them on social media.  Nearly every medium or large company with any sort of social media presence does some sort of sentiment analysis like the task we are about to do.\n",
    "\n",
    "Here we have a collection of tweets from the tech conference SXSW talking about apple brands.  These tweet are hand labeled by humans using a tool I built called [CrowdFlower](https://crowdflower.com).  Our goal is to build a classifier that can generalize the human labels to more tweets.\n",
    "\n",
    "The labels are what's known as training data, and we're going to use it to teach our classifier what text is positive sentiment and what text is negative sentiment.\n",
    "\n",
    "Let's take a look at our data.  Machine learning classes tend to talk mostly about algroithms, but in practice, machine learning practicioners usually spend most of their time looking at their data.\n",
    "\n",
    "This is a real data set, not a toy one, and I've left it uncleanup up so you will have to work through a few of the messy issues that almost always happen in the real world."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product\r\n",
      "\".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.\",iPhone,Negative emotion\r\n",
      "\"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW\",iPad or iPhone App,Positive emotion\r\n",
      "@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion\r\n",
      "@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion\r\n",
      "\"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)\",Google,Positive emotion\r\n",
      "@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd,,No emotion toward brand or product\r\n",
      ",,No emotion toward brand or product\r\n",
      "\"#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan\",Android,Positive emotion\r\n",
      "Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaVOB,iPad or iPhone App,Positive emotion\r\n"
     ]
    }
   ],
   "source": [
    "# Our data file is in ../scikit/tweet.csv\n",
    "# in a Comma Separated Values format\n",
    "# this command uses the shell to print out the first ten lines\n",
    "!head ../scikit/tweets.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, that looks good - if a little messy.  Let's open the file with some python\n",
    "\n",
    "## Loading Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                          tweet_text  \\\n",
      "0  .@wesley83 I have a 3G iPhone. After 3 hrs twe...   \n",
      "1  @jessedee Know about @fludapp ? Awesome iPad/i...   \n",
      "2  @swonderlin Can not wait for #iPad 2 also. The...   \n",
      "3  @sxsw I hope this year's festival isn't as cra...   \n",
      "4  @sxtxstate great stuff on Fri #SXSW: Marissa M...   \n",
      "\n",
      "  emotion_in_tweet_is_directed_at  \\\n",
      "0                          iPhone   \n",
      "1              iPad or iPhone App   \n",
      "2                            iPad   \n",
      "3              iPad or iPhone App   \n",
      "4                          Google   \n",
      "\n",
      "  is_there_an_emotion_directed_at_a_brand_or_product  \n",
      "0                                   Negative emotion  \n",
      "1                                   Positive emotion  \n",
      "2                                   Positive emotion  \n",
      "3                                   Negative emotion  \n",
      "4                                   Positive emotion  \n"
     ]
    }
   ],
   "source": [
    "import pandas as pd    # this loads the pandas library, a very useful data exploration library\n",
    "import numpy as np     # this loads numpy, a very useful numerical computing library\n",
    "\n",
    "# Puts tweets into a data frame\n",
    "df = pd.read_csv('../scikit/tweets.csv') # read the file into a pandas data frame\n",
    "print(df.head())       # print the first few rows of the data frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data frames are pretty cool, for example I can index the column by name.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    .@wesley83 I have a 3G iPhone. After 3 hrs twe...\n",
      "1    @jessedee Know about @fludapp ? Awesome iPad/i...\n",
      "2    @swonderlin Can not wait for #iPad 2 also. The...\n",
      "3    @sxsw I hope this year's festival isn't as cra...\n",
      "4    @sxtxstate great stuff on Fri #SXSW: Marissa M...\n",
      "Name: tweet_text, dtype: object\n"
     ]
    }
   ],
   "source": [
    "tweets = df['tweet_text'] # sets tweets to be the first column, titled 'tweet_text'\n",
    "print(tweets.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for understanding\n",
    "\n",
    "Some questions that I immediately asked myself (and you should too)\n",
    "1. How many rows are in our data set?\n",
    "2. How many different types of labels are there?  What are they?\n",
    "3. What year was this data collected?\n",
    "\n",
    "If you were my student and you were sitting in front of me, I would make you actually do this.  Unfortunately I can't force you to answer these questions yourself, but you will have more fun and learn more if you do.\n",
    "\n",
    "You will probably need to google around a little to figure out how to use the dataframe to answer these questions.  You can check out the cool pandas tutorial at https://pandas.pydata.org/pandas-docs/stable/10min.html - it will be useful for many things besides this tutorial!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1: How many rows in the dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(9093,)\n"
     ]
    }
   ],
   "source": [
    "print(tweets.shape) # print the shape of the variable tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like there are 9093 rows in our dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2: How many different types of labels are there?  What are they?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count                                   9093\n",
       "unique                                     4\n",
       "top       No emotion toward brand or product\n",
       "freq                                    5389\n",
       "Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we make target the list of labels from the third column\n",
    "target = df['is_there_an_emotion_directed_at_a_brand_or_product']\n",
    "\n",
    "# describe is a cool function for quick data exploration\n",
    "target.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hmmm... looks like there are 4 values for the sentiment of the tweets with \"No emotion toward brand or product\" being the most common."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "No emotion toward brand or product    5389\n",
       "Positive emotion                      2978\n",
       "Negative emotion                       570\n",
       "I can't tell                           156\n",
       "Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interesting, there is a label \"I can't tell\" along with \"Positive emotion\", \"Negative emotion\" and \"No emotion toward brand or product\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 3: What year was this data collected?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hm, it's a 3G iphone, when was that? 2010?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"rt ' It's 4 p.m. and the #iPad2 line at the Apple store is longer and wider – about 250 people! Only one more hour. ' #sxsw\""
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets[200]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok - the ipad2 was released in 2011, these tweets must be from 2011.\n",
    "\n",
    "## Data Cleanup\n",
    "\n",
    "If we dig into the data set one thing we'll notice is that some of the tweets are actually empty.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nan\n"
     ]
    }
   ],
   "source": [
    "print(tweets[6])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is best practice to not change the input data.  It's better to clearly show the ways that you've modified your data in your code.  In this case, we can use pandas to easily pull out the rows where the tweets are empty.  Here we are indexing into our data frame with the results of a pd.notnull function - this notation is really convenient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "fixed_tweets = tweets[pd.notnull(tweets)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also need to remove the same rows of labels so that our \"tweets\" and \"target\" lists have the same length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "fixed_target = target[pd.notnull(tweets)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a second to think about why I wrote \n",
    "*fixed_target = target[pd.notnull(tweets)]* instead of *fixed_target = target[pd.notnull(target)]*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. The most important thing to do when building a machine learning model is to actually look at your data.  \n",
    "2. Clean up your data in code, not in the original file\n",
    "\n",
    "## Questions\n",
    "\n",
    "1. How messy is this data?  It was labeled by humans - how many mislabels?\n",
    "2. Why is there a \"Can't Tell\" label - what kind of tweets get that?\n",
    "3. Are all the tweets in English?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
